fix(update_service): fixes TaskDefinition inactive exception#135
fix(update_service): fixes TaskDefinition inactive exception#135rokibulislaam wants to merge 3 commits intoGetSimpl:masterfrom
Conversation
- Added a new tag in task definition "task_definition_source" to identify if task definition was created with cloudformation or with boto - Included a preflight step to verify whether the existing deployment is dirty, and if it is, whether an image with the tag 'master' is available on ECR.
|
One crucial case it's not covering is, if someone tries to update only the properties which don't affect TD during Need to cover this, what's your opinion @praveenraghav01 ? |
|
I will suggest that we only do this if the changes include creation of a new Service. For rest we can just ignore image check |
Moved `update_service` preflight checks function to pre-flight.py
| repo_name = service_name + '-repo' | ||
| res = ecr_client.batch_get_image(repositoryName=repo_name, imageIds=[{'imageTag': 'master'}]) | ||
| if res['images'] == []: | ||
| raise UnrecoverableException("Current deployment is dirty. Please push an image tagged as 'master' to ECR.") No newline at end of file |
There was a problem hiding this comment.
On further thought, we should just bail stating that the current deployment is dirty and that an update cannot be made.
| @@ -164,4 +169,4 @@ def _print_progress(self): | |||
| if "FAIL" in final_status: | |||
| log_err("Finished with status: %s" % (final_status)) | |||
There was a problem hiding this comment.
Should emit a non-zero exit code
| repo_name = service_name + '-repo' | ||
| res = ecr_client.batch_get_image(repositoryName=repo_name, imageIds=[{'imageTag': 'master'}]) | ||
| if res['images'] == []: | ||
| raise UnrecoverableException("Current deployment is dirty. Please push an image tagged as 'master' to ECR.") No newline at end of file |
There was a problem hiding this comment.
Need to add cloudlift specific command to push to ecr. This can be done with upload_to_ecr
More context:
Whenever we execute the
create_servicecommand, a CloudFormation template is generated. Within this template, in the ECS Task definition section, we set the default Docker image URL with themastertag if the current git status isdirtyor if the work tree is not clean. Additionally, we keep theDesiredCountas zero since, at this stage, we don't have an image uploaded to ECR. If we were to proceed with deployment, it would fail. Consequently, at this point, CloudFormation creates a task definition with the revisionFamily:1.On the other hand, when we execute the
deploy_servicecommand, we build a Docker image and upload it to ECR with a tag based on the current git status or a specific version provided using the--versionoption. If the work tree is dirty, we append thedirtytag to the Docker image and push it to ECR. Next, we manually create a new task definition usingboto3, utilizing the Docker image URI we just deployed with the corresponding image tag. We then increase the desired count from 0 to 1, prompting ECS to start the task. At this stage,boto3creates a new task definitionFamily:2, while also deactivatingFamily:1usingboto3.What's the issue?
The problem arises when we execute the
update_servicecommand, as we rely on CloudFormation to make the necessary changes. We regenerate the CloudFormation template and request CloudFormation to create a changeset. CloudFormation compares the old and new templates, resulting in the creation of a changeset. Now, two scenarios can occur:If we make changes to the cloudlift service configuration that affect the task definition, such as updating the
memory_reservationwhich modifiesmemoryReservationin the ECS task definition. In this scenario, CloudFormation creates a new task definition with a new revision,Family:3. However, in the new task definition, it sets the new image URI with themastertag. The CloudFormation stack update is successful, but since we had a dirty worktree, we only have one image deployed on ECR with thedirtytag. As a result, ECS attempts to run the task but continually fails since the image with themastertag does not exist. This is where the preflight check becomes relevant.If we make changes to the cloudlift configuration that do not affect the task definition, such as updating the ELB/ALB health check path with
health_check_path, which alters the Load Balancer and target group. Once again, we create a changeset and execute it. However, CloudFormation attempts to reference the last task definition created by itself, which in our case isFamily:1. Nonetheless, we manually deactivated this task definition usingboto3. Consequently, the CloudFormation stack update fails with a client exception:TaskDefinition is inactive, Status Code: 400. To address this, we are introducing a new tag (task_definition_source) to the task definition to identify whether the previous task definition was created by CloudFormation orboto3. If it was created by CloudFormation, we do not deregister the task definition.Point to remember is CloudFormation always attempts to reference the task definition it created, regardless of the version number. Even if you manually created a new task definition, such as
Family:999, and updated the service to run fromFamily:999, CloudFormation would still try to reference the version it created during its last operation. Could this be a bug in CloudFormation or is it designed to behave so?